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Abstract 

We describe a diagnosis algorithm entered into 
the Second International Diagnostic Competition. 

We focus on the first diagnostic problem of the in- 
dustrial track of the competition in which a diag- 
nosis algorithm must detect, isolate, and identify 
faults in an electrical power distribution testbed 
and provide corresponding recovery recommen- 
dations. The diagnosis algorithm embodies a 
model-based approach, centered around quali- 
tative event-based fault isolation. Faults pro- 
duce deviations in measured values from model- 
predicted values. The sequence of these de- 
viations is matched to those predicted by the 
model in order to isolate faults. We augment 
this approach with model-based fault identifica- 
tion. which determines fault parameters and helps 
to further isolate faults. We describe the diag- 
nosis approach, provide diagnosis results from 
running the algorithm on provided example sce- 
narios, and discuss the issues faced, and lessons 
learned, from implementing the approach. 

1 INTRODUCTION 

Timely and robust detection, isolation, and identifica- 
tion of faults in engineering systems lies at the core of 
systems health management technologies. This paper 
presents a model-based, qualitative, event-based fault 
diagnosis scheme that was entered into the Second In- 
ternational Diagnostic Competition (DXC’10) (Poll et 
al., 2010). The competition allows for a compara- 
tive study of different diagnostic approaches, and in- 
cludes multiple diagnostic problems. We focus on di- 
agnostic problem I (DPI) of the industrial track of the 
competition, which consists of fault diagnosis and re- 
covery for a subset of the Advanced Diagnosis and 
Prognosis Testbed (ADAPT) (Poll et al., 2007), called 
ADAPT-Lite. ADAPT is an electrical power distribu- 
tion system, representative of those found in space- 
craft. Our entry into DXC’ 10 is called QED, for qual- 
itative event-based diagnosis. 

QED extends the TRANSCEND diagnosis scheme 
described in (Mosterman and Biswas, 1999; Daigle et 


al., 2010). In this scheme, fault isolation is achieved 
through analysis of the transients produced by faults, 
manifesting as deviations in observed behavior from 
predicted nominal behavior. We incorporate addi- 
tional diagnostic information, known as relative mea- 
surement orderings, which provide a partial order- 
ing of measurement deviations for different faults, 
leading to an enhanced event-based fault isolation 
scheme (Daigle et al., 2007; 2009). DPI requires fault 
identification, and includes abrupt, incipient, and in- 
termittent fault profiles. TRANSCEND deals only with 
abrupt profiles, so we incorporate extensions for incip- 
ient faults (Roychoudhury, 2009), and new work for 
identification of intermittent faults. 

The paper is organized as follows. Section 2 
overviews the diagnosis approach. Section 3 provides 
the system model. Section 4 describes fault detection 
and symbol generation. Section 5 discusses fault isola- 
tion, and Section 6 describes fault identification. Sec- 
tion 7 describes fault recovery. Section 8 presents di- 
agnosis results, and Section 9 concludes the paper. 

2 DIAGNOSIS APPROACH 

We focus on Diagnostic Problem I (DPI) of the indus- 
trial track of DXC’ 10. The problem here is to decide 
whether the mission should be aborted or continued. In 
order to make this decision, we must determine if the 
system is faulty, and if the fault warrants an abort rec- 
ommendation or not, given the system observations. 

The diagnosis architecture is shown in Fig. 1. The 
system receives inputs u(f) and produces outputs y(f). 
Due to the simplicity of the monitored system, we use 
a predictive model instead of an observer. Our sys- 
tem model runs simultaneously, producing predicted 
outputs y (t), given the inputs u(f). Using statisti- 
cal methods, the fault detection module decides when 
a measurement has deviated from its nominal value, 
triggering fault isolation. Measurement deviations, 
viewed as events, are abstracted into a symbolic rep- 
resentation using the symbol generator. The sequence 
of these symbols, where a symbol is denoted by a , is 
used to isolate faults F. Fault isolation consists of can- 
didate generation at the point of fault detection, and 
hypothesis refinement as new symbols are provided. 
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Figure 1 : Diagnosis and recovery architecture. 

E240 (vjj) E242 E265 (v,J 11267 (i rn J 



FAN416 

STS16 

(m) 


Each fault f £ F is associated with a component, its 
fault mode, and its fault parameters. Fault identifica- 
tion computes, for each fault f £ F, the values of the 
fault parameters. An oracle is provided by DXC’ 10 
that decides for each fault / whether an abort is rec- 
ommended, producing a set of recommendations R. 
The decision module selects a recommendation from 
R and outputs the associated control actions C. 

3 SYSTEM MODELING 

Our diagnosis approach is model-based, requiring a 
model of both nominal and faulty behavior. It is used 
for prediction of nominal values and within the fault 
detection, isolation, and identification modules. In 
the following, we describe the models of nominal and 
faulty behavior of the ADAPT-Lite system. 


with a capacitor-resistor pair, C s and R s , that subtracts 
from the voltage provided by Co (see Fig. 2). In real- 
ity, R s is a function of state of charge, depth of charge, 
and temperature, but, for our purposes, we may assume 
it to be constant. C a is much smaller than Co- Since 
the battery voltage decreases faster at lower voltages, 
we express Co as a function that decreases with volt- 
age. The battery also has a large parasitic resistance in 
parallel that accounts for the self-discharge of the bat- 
tery due to various parasitic processes, which may be 
omitted here. The battery may then be described as 


1 

Co 

1 

Cl 


v 0 = 

Do 


IB 


Va_ 

R a 


3.1 Nominal Model 

The schematic of ADAPT-Lite is given in Fig. 2. Sen- 
sors are denoted in italics. Sensors prefixed with an E 
are voltage sensors, those with an IT are current sen- 
sors, and those with ISH or ESI I are for states of circuit 
breakers and relays. TE228 is the battery temperature 
sensor, and ST516 is the fan speed sensor. Note that 
the inverter converts DC power to AC, and E265 and 
IT267 provide rms values of the AC waveforms. We 
describe models for each of the components in turn. 

The battery consists of two 12 V lead-acid batter- 
ies in series. We lump these together into a single 
battery model. Battery models typically must include 
a set of complex nonlinear behaviors (Ceraolo, 2000; 
Daigle et al., 2009). However, most of these character- 
istics are not evident within the short, four-minute time 
frame of the experimental scenarios. Therefore, we 
utilize a simplified electrical circuit equivalent model, 
consisting of a single large capacitance, Co, in series 


v B =v o - v s , 

where vb is the battery voltage, is is the battery cur- 
rent, Vo is the voltage across Co, and v s is the voltage 
drop across the C s -R s pair. The battery temperature is 
unchanging over the scenario length so we express it 
as a constant Tg. Deviation from Tg implies a fault. 

The inverter transforms DC power to AC power. 
When operating nominally, the voltage v rms is con- 
trolled very close to 120 V AC. From a power bal- 
ance of the AC and DC sides of the inverter, we have 
Vinviinv = e ■ v rms i rms , where e is the inverter ef- 
ficiency, v mv is the input DC voltage to the inverter, 
and i inv is the input DC current to the inverter. The in- 
verter still draws a small amount of current even when 
irms = 0, and this is captured as a DC resistance par- 
allel to the inverter, R on . We have 
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(a) Offset. 


(b) Drift. 
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Figure 3: Fault profiles. 


The DC and AC resistive loads are modeled as pure 
resistances, with Rd c and R ac , respectively. The fan 
has both resistive and inductive properties, so intro- 
duces a phase difference in its current from the input 
voltage. We express its equivalent impedance as Zf an 
and phase offset as <j>. The inverter current i rms is the 
vector sum of the AC load and fan currents. Nomi- 
nally, the fan is always on, so we can express its speed 
as a constant to. Any deviation from ui implies a fault. 

Relays and circuit breakers are modeled as ideal 
switches. For DPI, there are no mode changes dur- 
ing nominal operation, so observed mode changes are 
directly attributed to faults. 

Using the available data sets, we have identified the 
parameters of our model and obtained very accurate 
descriptions of nominal behavior. 

3.2 Fault Modeling 

We consider both parametric faults, defined as unex- 
pected changes in system parameter values, and dis- 
crete faults, defined as unexpected changes in the oper- 
ating mode of a component. Parametric faults include 
changes in the AC and DC resistances, R ac and Rd c , 
and additive terms to sensor equations. These param- 
eters may assume offset, drift, and intermittent offset 
profiles, as shown in Fig. 3 (t f denotes the time of fault 
occurrence). 

For an offset, the faulty value Pf(t) is defined by 

Pf(t) =p(t) + Ap, 

where p(t) is the nominal value, and A p is the offset. 
A drift is defined by its slope m, i.e, 

p f (t) =p(t)+m(t-tf). 

For intermittent offsets, the offset alternates be- 
tween zero and a nonzero value. The profile is de- 
fined by three parameters, the mean offset p\ p , i.e, 
mean(Ap!, Ap 2 , . . .), the mean faulty time pf, i.e, 
mean(Af/i, Atf 2 , ■ • •)> an d the mean time it is nomi- 
nal p n , i.e, mean(Af„i, Af n2 , • • •)• 

Discrete faults include stuck faults of the relays and 
circuit breakers, inverter failure, load failure, fan over- 
speed and underspeed faults, the introduction of a bat- 
tery parasitic resistance R p , and sensor stuck faults. 
Note that sensor stuck faults are defined as y(t) = c, 
where c is a constant, and sensor noise is absent. 

4 FAULT DETECTION AND SYMBOL 
GENERATION 

Each sensor is assigned a fault detector and a sym- 
bol generator. For each sensor output y(t), we de- 
fine the residual as r(t) = y(t) — y(t), where y(t) is 


the model-predicted output signal. Statistically signif- 
icant nonzero residual signals indicate faults. Follow- 
ing fault detection for a sensor, its symbol generator 
is initiated to calculate magnitude, slope, and discrete 
change symbols that are used for fault isolation. 


4.1 Fault Detection 

We use the Z-test for robust fault detection using a set 
of sliding windows (Daigle et al., 2010). A small win- 
dow, IT 2 , is used to estimate the current mean p r (t) of 
a residual signal: 
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The variance of the nominal residual signal, of (f), is 
computed using a large window W\ preceding IT 2 , by 
a buffer Wdeiay, which ensures that W\ does not con- 
tain any samples after fault occurrence. The variance 
is computed using: 
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where 
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r(i). 


A pre-specified confidence level determines the 
bounds z~ < 0 and z + > 0 for a two-sided Z-test. 
The fault detection thresholds, e~(t) and e+(f), are 
dynamically computed using: 
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where E is a modeling error term. A fault is detected 
if i-i r (t) lies outside of the thresholds at time t. The 
parameters Wi, IT 2 , Wdeiay , the z bounds, and E must 
be tuned to optimize performance. 

Note that for stuck faults in sensors (recall that these 
faults eliminate noise from the signal y(t)), a sensor 
that is stuck within nominal ranges will not be detected 
by the above method. Hence, an additional detection 
test is required for these faults. For sensor y. 


stuck y (t) 


true, I y(t) -y{t-i) |=0 

false, otherwise 



where N s is a pre-defined limit. If the past N s consec- 
utive samples of y(t) are all the same, then stuck(t) = 
true. The value of N s depends on the particular sen- 
sor. For some sensors in ADAPT-Lite, N s must be 
quite large (e.g., N s = 400), because the sensors nor- 
mally repeat the same value for long periods of time. 
For the discrete sensors ISH236 and ESH244A, we 
effectively set N s = oo, because these sensors are 
binary-valued and noiseless. 

4.2 Symbol Generation 

Robust methods based on the Z-test are also used for 
symbol generation (Daigle et al., 2010). The first sym- 
bol is derived directly from the result of fault detection. 
If the measurement residual, r(t), is greater than ef (t) 
(or less than e~ (i)), we obtain a + (or -). 

The second symbol calculated is for the direction of 
the slope of the residual. We start with an estimate of 
the initial residual value, y ro (td), at the time of fault 
detection, t,j, over a small window W 3 : 


We wish to determine whether each signal belongs to 
a population with zero mean, and choose the variance 
of the population to be the variance of the residual as 
a good approximation of the true variance of the zero- 
mean distribution. The thresholds are computed as: 
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where E c is a modeling error term. If y y (td) is out- 
side its bounds, we say it is nonzero, otherwise we say 
it is zero. Similarly, if Hy(td) is outside its bounds, we 
say it is nonzero, otherwise we say it is zero. If the 
estimate is nonzero and the measurement is zero, we 
report Z, and if the estimate is zero and the measure- 
ment is nonzero, we report N, else, we report X. 


5 FAULT ISOLATION 
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The mean of the residual slope is computed over a win- 
dow from td to t: 


k"T d (() 


1 

t — td + 1 



Using bounds z 

£" (f) = 2 

£ r d (t) =Z 


and z + , the thresholds are: 

ar (vm + vm)~ Es 



The - (+) symbol is generated when y, rd < s (t) 
(Hr d > The window used to calculate the 

slope is increased until the symbol is successfully gen- 
erated, or t — td becomes larger than a pre-specified 
limit, at which the slope is reported as 0, implying 
that the true slope is either zero, or unknown but very 
small. If the first and second symbols do not match, 
we interpret this as a discontinuity in the signal, other- 
wise, a smooth change is assumed. 

We compute also a discrete change symbol, which 
is used to decide whether a signal has switched be- 
tween a nonzero and zero value, which is useful 
for distinguishing between parametric and discrete 
faults (Daigle et al., 2010). To compute the discrete 
change symbol, we do not use the residual, but use the 
observed and estimated values of the signal. We com- 
pute the mean of the measured signal, y(t), and the 
mean of the estimate, y(t), over a small window, W c : 
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We utilize a qualitative diagnosis methodology that 
isolates faults based on the transients they cause in sys- 
tem behavior, manifesting as deviations in observed 
measurement values from nominal measurement val- 
ues (Mosterman and Biswas, 1999). The transients are 
abstracted using qualitative + (increase), - (decrease), 
and 0 (no change) values and N (zero to nonzero), Z 
(nonzero to zero), and X (no discrete change) values 
to form fault signatures. Fault signatures represent 
these measurement deviations from nominal behav- 
ior as the immediate (discontinuous) change in magni- 
tude, the first nonzero derivative change, and discrete 
zero/nonzero value changes in the measurement from 
the estimate caused by mode changes. 

In addition to fault signatures, we also capture the 
temporal order of measurement deviations, termed rel- 
ative measurement orderings (Daigle et al., 2007), 
based on the intuition that fault effects will manifest 
in some parts of the system before others. Measure- 
ment orderings are based on analysis of the transfer 
functions from faults to measurements (Daigle et al., 
2007). The combination of fault signatures and mea- 
surement orderings forms qualitative event-based in- 
formation for fault isolation (Daigle et al., 2009). 

Measurement orderings do not allow one to elimi- 
nate a fault hypothesis based on the lack of observing 
a measurement deviation. However, for ADAPT-Lite, 
there are a substantial number of cases where this is 
desirable. When a sensor fault occurs, it will cause a 
deviation in a single measurement only, so, if only one 
measurement deviation has been observed for a sig- 
nificant amount of time after fault detection, we may 
assume that it is a sensor fault and eliminate all can- 
didates that are inconsistent with that assumption. For 
specific measurements, we also expect deviations to 
occur within a certain amount of time. For example, 
faults that affect the fan speed should cause deviations 
in ST516 within 30 s of td- If ST516 has not deviated 
by then, all such faults may be eliminated. 

The fault signatures and measurement orderings can 
be computed manually or automatically from a system 
model. The temporal causal graph (TCG) representa- 
tion, derived from the system model, can be used with 
a forward propagation algorithm to predict qualitative 



Table 1: Selected Fault Signatures for ADAPT-Lite 


Fault 

E240 

E265 

1T240 

IT267 

ST516 

AC483 Ap > 0 

o+x 

+ox 

-ox 

-ox 

oox 

DC485 Ap > 0 

o+x 

oox 

-ox 

oox 

oox 

E240 Ap > 0 

+ox 

oox 

oox 

oox 

oox 

E240 m> 0 

o+x 

oox 

oox 

oox 

oox 

E240 pa p > 0 

+ox 

oox 

oox 

oox 

oox 

EY260 stuck open 

+ox 

-oz 

-oz 

-oz 

o-x 

FAN416 underspeed 

o+x 

+ox 

-ox 

-ox 

o-x 


effects of faults on measurements and their possible se- 
quences of deviations (Mosterman and Biswas, 1999; 
Daigle, 2008). A TCG captures system variables as 
nodes in a graph, and the mathematical relations be- 
tween them as edges. Fault parameters appear on 
edges, allowing the propagation of parameter changes 
to be propagated over the system variables. Since there 
are no mode changes in the considered system, gener- 
ation of signatures and orderings is performed offline, 
and provided as input to the diagnosis algorithm. 

Selected fault signatures for ADAPT-Lite are shown 
in Table 1, where the first symbol is the immediate 
change in magnitude, the second is the slope, and the 
third is the discrete change. For example, a positive 
offset in E240 will cause an abrupt increase in the 
E240 residual with no change in slope, and no dis- 
crete change behavior (+0X). No other sensors are af- 
fected (00X). An intermittent offset may also cause 
this initial transient, therefore, fault identification is 
necessary to distinguish these faults. The underspeed 
fault of the fan will cause a smooth increase in bat- 
tery voltage (0+X), an abrupt increase in inverter volt- 
age (+0X), abrupt decreases in battery and inverter 
currents (-0X), and a smooth decrease in fan speed 
(0-X). Many measurement orderings may be derived 
for a number of faults also. Because of the capacitive 
effect of the battery, faults cause changes in currents 
before changes in voltages, except for discrete failures 
which cause voltages to go directly to zero. For fan 
faults, the inverter current is affected before the change 
in fan speed. If we ever see the fan speed deviate first, 
then this allows us to immediately conclude that it is a 
sensor fault in ST516. 

6 FAULT IDENTIFICATION 


A p(t) = p(t) — p(t), where p(t ) is the nominal value. 
For sensor faults, we compute the current offset at each 
time step using A p(t) = y(t ) — y(t), where y(t) is the 
model-predicted output at time t. In each case, we then 
analyze the history of A p over \t,i , t] to determine the 
offset, drift, or intermittent offset parameters at t. 

The resistance value Rd c is given by 


Rdc{i) 


VB{t) 

idc(t )’ 


where for Vnit) we use E281 , and for id c (t) we use 
IT281. The resistance value R ac is given by 


Rac(t) = 


Vac{t) 


iac{t ) 2 - (l^sin^ 


Vao(t) 


cost 


where for v ac (t) we use \f2E265 , and for i ac (t) we use 
\/21T267 . Recall that (j) is the phase offset introduced 
by the fan load, and Z f an is its equivalent impedance. 
The nominal values of Zf an and cj) were calculated by 
solving the following expression at steady state using 
two different values R ac and measured values of i ac 
and v ac : 


Kac I 


Vac 
Z fan 


(cos (j) + j sin (j>) 


Vac 

Rac 


This equation is derived from the complex impedance 
expressions for the fan and R ac . 

To identify the parasitic load R p , we make two 
simplifying assumptions. First, since C s is relatively 
small, the battery voltage reaches a new steady-state 
value soon after the fault is injected, and we may omit 
C s from the model. Second, since Co is very large, the 
voltage vo will not change substantially over the dura- 
tion of the scenario, and we may assume it is constant 
during that period. Given this, the equivalent circuit 
simplifies to that shown in Fig. 4. The value of the 
parasitic load may then be calculated directly as 


R P (t ) 


Left ) 

p;(v 0 ~v B (t)) 




t>t d . 


Fault identification is initiated immediately after the 
initial set of fault candidates is produced after fault de- 
tection. Each candidate has its own identification rou- 
tine that updates its estimate at every time step. When 
the identification result is inconsistent with the fault 
mode, the fault candidate is eliminated; in this way 
identification helps in the isolation step. 

Our fault identification procedure is related to (Roy- 
choudhury et al., 2008; Bregon et al . , 2009) in that it 
uses submodels for fault identification. However, we 
use simpler methods for estimating the fault param- 
eters in our approach. For faults dealing with R ac , 
R dc , and R. p , we directly calculate the parameter value 
p(t) at each time step f as a function of sensor val- 
ues at t, and (except for R p , in which the goal is to 
calculate R p directly), compute the offset at t using 


Here, vb ( t ) is provided by E240 , and i b (i) is provided 
by the sensor IT240. The voltage vq is calculated over 
a small portion of data at the beginning of the scenario 
(where R p is guaranteed to be absent) as 

v 0 = v B + i B R s , 

i.e., when R p is not attached, iq = i n , and IT240 may 
be used as is- We use the value of R s estimated during 
model identification. 

Given a history of A p values over [ t d , t], we com- 
pute the fault parameters for the given fault mode. For 
offset faults, we simply take the mean of A p(t), and 
this provides the offset. For drift faults, we compute 
the slope over three different intervals, as shown in 
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Figure 4: Battery equivalent circuit. 



Figure 5: Identification of drift faults. 


resistance all have the same qualitative fault signa- 
tures and measurement orderings, so cannot be distin- 
guished based on that information alone. But, for each 
of these, we can calculate an equivalent resistance off- 
set. If the true fault is a failure in AC483 or EY272, 
then the R ac equation yields a large negative value, 
but if a resistance offset is the true fault, then the R ac 
equation will yield a reasonable value, allowing us to 
differentiate the faults. 

Identification is also used to help differentiate be- 
tween different fault profiles. Both persistent and in- 
termittent offsets give the same signatures and order- 
ings, but if the fault is truly persistent, then the fault 
parameters for the intermittent fault mode will have 
very small values for /L i n . If /j, n is less than 0.5 s by 
60 s past td, we can eliminate the intermittent fault 
mode as a candidate. Identification may also help cor- 
rect isolation mistakes. If a drift is small enough, then 
the corresponding slope symbol may be calculated as 
0, wrongly identifying offset as the fault mode. But, 
we can compute the offset at td and at t, and if they 
are significantly different from each other (e.g., a 25% 
difference), then the fault mode is actually a drift. 


Fig. 5. We calculate 

Ap 2 - Api 


mi(t) = 


m 2 (f) = 


\{t~ t d ) 
Ap 3 - Ap 2 
\{t~ t d ) 
Ap 3 - Apx 


” 3<i) = t-u 

m(t) = median(mi(f), ra 2 (i), m 3 (t)). 


For a large t — td, the effects of noise are diminished 
and accurate estimates may be achieved. Taking the 
median further decreases the sensitivity to noise. Other 
techniques may be used, such as taking the mean of a 
few samples around td and around t, and computing 
slope based on those, however, the method we adopt 
here has proven effective for the selected case study. 

For intermittent offset faults, we utilize a limit l 
above which A pit) is considered faulty, and below 
which is considered nominal. The limit l is typically 
chosen as something within 1-2% of the nominal value 
of y(t) or pit). We step through the signal A p(t), and 
maintian two counters k n and kf. Each time we tran- 
sition from a nominal value to a faulty value, we incre- 
ment kf, and when we transition from a faulty value 
to a nominal value, we increment k n . In effect, these 
two counters keep track of the number of times the 
signal was faulty and nominal. For each new nominal 
value, we increment a second counter t„ that keeps 
track of the total amount of time the signal is nominal. 
Similarly, for each new faulty value, we increment a 
counter t/ that keeps track of the total amount of time 
it is faulty. Then, the fault parameters are 


Pa p = mean(t;/), /.// = mean 




Fault identification is also used to help further iso- 
late faults. For example, AC483 failing, EY272 be- 
coming stuck open, and an increase in the AC483 


7 FAULT RECOVERY 

Towards the scenario end, a decision must be made as 
to whether the mission should be aborted or continued. 
The fault identification module computes a candidate 
set F, with each / G F being defined by the compo- 
nent, its fault mode, and the associated fault parame- 
ters. The oracle is viewed as a function O(f) which, 
for a given fault, computes a recommended set of com- 
mands C. For DPI, either C = {abort} or C = 0. 

Each command set has an associated cost. The cost 
is zero when the correct command is chosen by the 
decision module. If the algorithm recommends abort 
when the mission should be continued, the associated 
cost is that of the mission. If the algorithm recom- 
mends to continue when it should have been aborted, 
the associated cost is that of the mission and the ve- 
hicle. Therefore, we take the conservative approach 
where: 

c _ f {abort}, {abort} G {O(f) : f G F} 

[0, otherwise, 

i.e., if an abort is recommended for at least one / G F, 
we recommend abort. This is satisfactory because we 
have a high confidence in our diagnosis algorithm. 

8 EXPERIMENTAL RESULTS 

The results from running QED on the provided fault 
scenarios are as shown in Tables 2 and 3. The nom- 
inal scenarios are omitted, as no false positives were 
detected. The time of fault occurrence is denoted by 
tf, of detection by td, and of isolation by f,. All times 
are in seconds. The correct fault was always isolated, 
and the fault parameters, in most cases, are close to 
the actual values. Unique diagnoses were not obtained 
in four cases, where the faults are not actually distin- 
guishable: (i) AC483 failing and its relay EY272 get- 
ting stuck open, (ii) the fan failing and its relay EY275 
failing, (Hi) DC485 failing and its relay EY284 becom- 
ing stuck open, and (iv) the inverter failing and the pre- 
ceding circuit breaker CB262 failing. In each of these 
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Table 2: Mean Detection and Isolation Times 


Fault Class 

Size of Class 

td - tf 

ti tf 

All Faults 

34 

6.21 

42.91 

Physical Faults 

22 

5.30 

36.64 

Sensor Faults 

12 

7.90 

54.42 

Abrupt Faults 

27 

2.95 

29.33 

Incipient Faults 

7 

18.81 

95.30 

Persistent Faults 

25 

8.38 

50.24 

Intermittent Faults 

9 

0.21 

22.57 


cases, the lack of a relay or circuit breaker sensor re- 
sults in the ambiguity. In these cases, the recommen- 
dation is always the same, so the correct recommenda- 
tion was made. For all other cases, the candidate was 
uniquely isolated and the fault parameters were iden- 
tified with enough precision to also obtain the correct 
recommendation. 

The mean detection and isolation times (in seconds) 
are shown in Table 2. Here, we divide the faults into 
different classes. On average, detection took under 
10 s, and isolation took under 60 s. Physical faults 
could be detected and isolated faster than sensor faults, 
and abrupt faults could be detected and isolated signif- 
icantly faster than incipient faults. Because intermit- 
tent faults were always abrupt, they could be detected 
and isolated faster than the persistent faults, which in- 
cluded drift faults. 

As an illustrative example, we consider a resistance 
offset in AC483 that occurs at 180.4 s with A p = 15, 
shown in Fig. 6. At 180.4 s, a decrease in IT240 is de- 
tected. The initial candidate list contains resistance in- 
creases in AC483, AC483 failing, resistance increases 
in DC485, DC485 failing, each of the circuit breakers 
failing, each of the relays failing, the fan failing or in 
the underspeed mode, the inverter failing, or faults in 
IT240. At 180.4 s, a decrease in IT267 is detected, 
eliminating faults in IT240 and DC485. At 183.4 s, 
QED computes that IT240 did not go to zero, so the 
fault in CB236 is eliminated. At 183.5 s, QED com- 
putes that IT267 has not gone to zero, eliminating the 
remaining circuit breaker faults and the inverter failure 
as candidates. At 191.3 s, QED computes the slope of 
IT240 as 0, eliminating the resistance drift fault. At 
210.3 s faults in the fan and its relay are eliminated be- 
cause deviations in ST516 have not been observed. At 
220 s, AC483 failed and EY272 stuck open are elimi- 
nated because the equivalent resistance offset for these 
faults does not agree with the calculated resistance off- 
set. Also, the intermittent resistance offset is elimi- 
nated because p n was calculated as 0 s. This leaves a 
resistance offset of AC483 as the remaining candidate, 
and the offset is calculated as 13.9, which differs from 
the tme value by 7.3%. The corresponding recommen- 
dation is to continue the mission. 

9 CONCLUSIONS 

We described our entry into DXC’ 10, called QED, 
which incorporates principles of qualitative event- 
based fault isolation. We extended our approach with 
fault identification and several heuristics to further 
improve fault isolation and identification, based on 
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Figure 6: Selected measurements for AC483 offset 
with A p = 15. 


knowledge of the system. We found it crucial to also 
utilize the results of fault identification to help resolve 
further ambiguities in fault isolation. The performance 
of the algorithm hinges on correct symbol generation, 
but it can be difficult to tune the slope calculation be- 
cause it is also used for discontinuity detection. We 
believe that a separate reliable method of discontinuity 
detection is necessary to alleviate this problem. 

Although successful on the provided diagnosis sce- 
narios, there are potential problems that could arise 
when applied to the competition data set. Our fault 
detectors and symbol generators were tuned for opti- 
mal performance with the provided scenarios, but may 
be too sensitive for some of the competition data, re- 
sulting in false alarms or incorrect symbol generation, 
which may result in incorrect diagnoses, and, conse- 
quently, incorrect recovery recommendations. In many 
places, we make final fault isolation decisions based 
on manually selected quantitative thresholds, so incor- 
rect diagnoses cannot be later corrected using new ev- 
idence (e.g., finding out a stuck sensor is not really 
stuck). This issue may be overcome using nonmono- 
tonic or probabilistic reasoning. The approach is lim- 
ited to single faults, so the capability to handle multi- 
ple faults, of which initial progress has been described 
in (Daigle, 2008), would be needed to apply the ap- 
proach to the full ADAPT system. In order to help 
manage the complexity of the much larger system, dis- 
tributed diagnosis approaches such as those explored 
in (Roychoudhury et al., 2009; Roychoudhury, 2009; 
Daigle et al., 2010) may be useful as well. 
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Table 3: Diagnosis Results 


True Candidate 

tf 

td 

U 

F 

C 

AC483 A p = -21 

90.2 

90.2 

150.2 

AC483 Ap = -21.82 

{abort} 

AC483 A p = 15 

180.4 

180.4 

220 

AC483 Ap = 13.91 

{abort} 

AC483 m = -0.1 

32 

41.3 

101.3 

AC483 m = -0.09 

{abort} 

AC483 m = 0.071 

30 

42.1 

102.1 

AC483 m = 0.069 

{abort} 

AC483 pa p = —21 , pf = 3.6, p n = 19.6 

29.9 

30 

33.2 

AC483 pAp = -20.4, p s = 3.61, p n = 17.16 

0 

AC483 pAp = — 148, pf = 3.8, p n = 5.7 

30.5 

30.6 

33.6 

AC483 pa p = -148.7, p f = 3.73, p„ = 5.12 

{abort} 

AC483 failed 

50.1 

50.1 

110.1 

AC483 failed, EY272 stuck open 

{abort} 

BAT2 R p = 6 

120.8 

157.4 

160.7 

BAT2 R p = 6.0 

0 

CB266 failed 

61.5 

61.5 

64.7 

CB266 failed 

{abort} 

DC485 A p = -2.5 

59.5 

59.7 

119.7 

DC485 Ap = -2.56 

{abort} 

DC485 A p = 4.5 

150.5 

150.5 

210.5 

DC485 Ap = 3.85 

{abort} 

DC485 m = -0.005 

35 

77.6 

137.6 

DC485 m = -0.0044 

{abort} 

DC485 m = 0.021 

30 

44.6 

220 

DC485 m = 0.021 

{abort} 

DC485 PA V = -3, p f = 3.9, /i n = 7.7 

30.5 

30.7 

34.2 

DC485 pAp = -2.94, p f = 3.73, /in = 7.42 

{abort} 

DC485 pAp = — 2.8, pf = 4.1, /t„ = 6.1 

30.6 

30.8 

35.1 

DC485 pA P = -2.96, p f = 4.17, p n = 5.53 

{abort} 

DC485 pa p = — 3.2, pf = 3.9, /in = 13.5 

30.5 

30.8 

34.3 

DC485 pAp = -3.07, p f = 3.82, p n = 12.1 

0 

DC485 failed 

70.8 

70.8 

75.2 

DC485 failed, EY284 stuck open 

{abort} 

E240 Ap = -1 

110 

110 

170 

E240 Ap = -0.99 

0 

E242 m = 0.005 

75 

108.3 

168.3 

E242 m = 0.0055 

0 

E265 c = 0 

150 

150 

164.6 

E265 c = 0 

0 

E281 pAp = 0.9, p f = 2.7, /in = 17.8 

35 

35.3 

39 

E281 pAp = 0.99, p f = 3.14, /in = 18.56 

0 

EY244 stuck open 

131.6 

131.6 

131.7 

EY244 stuck open 

{abort} 

FAN416 failed 

80.8 

80.8 

84.2 

EY275 stuck open, FAN416 failed 

{abort} 

FAN416 overspeed 

91 

91.1 

96.7 

FAN416 overspeed 

{abort} 

FAN416 underspeed 

101 

101.1 

115.7 

FAN416 underspeed 

0 

INV2 failed 

111 

111 

113.4 

CB262 failed, INV2 failed 

{abort} 

1T240 m = 0.005 

90 

106.6 

166.6 

IT240 m = 0.0047 

{abort} 

IT240 pA P = 7.8, p f = 3.1, p n = 6.2 
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35 

95 

IT240 pAp = 7.77, p f = 3.08, /in = 5.98 

{abort} 

IT267 Ap = -1 

40 

40.2 

100.2 

IT267 Ap = -1 

{abort} 

IT267 m = 0.015 

50 

53.2 

113.2 

IT267 m = 0.015 

{abort} 

IT267 pAp = —0.3, pf = 3.1, p n = 15 

40 

40.4 

100.4 

IT267 pA P = -0.3, p f = 3.11, p n = 14.77 

0 

IT281 Ap = 0.2 

120 

120.9 

180.9 

IT281 Ap = 0.2 

0 

IT281 pA P = — 1 ,Pf — 3, pn -— 5.2 

50 

50.3 

110.3 

IT281 pAp = — 1, Pf = 3, /in = 5.2 

{abort} 

ST516 c = 840 

170 

209.6 

209.6 

ST516 c = 840 

{abort} 
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